{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Named Entity Extraction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Named entity extraction task aims to extract phrases from plain text that correpond to entities.\n", "Polyglot recognizes 3 categories of entities:\n", "\n", "- Locations (Tag: `I-LOC`): cities, countries, regions, continents, neighborhoods, administrative divisions ...\n", "- Organizations (Tag: `I-ORG`): sports teams, newspapers, banks, universities, schools, non-profits, companies, ...\n", "- Persons (Tag: `I-PER`): politicians, scientists, artists, atheletes ..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Languages Coverage" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The models were trained on datasets extracted automatically from Wikipedia.\n", "Polyglot currently supports 40 major languages." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 1. Polish 2. Turkish 3. Russian \n", " 4. Indonesian 5. Czech 6. Arabic \n", " 7. Korean 8. Catalan; Valencian 9. Italian \n", " 10. Thai 11. Romanian, Moldavian, ... 12. Tagalog \n", " 13. Danish 14. Finnish 15. German \n", " 16. Persian 17. Dutch 18. Chinese \n", " 19. French 20. Portuguese 21. Slovak \n", " 22. Hebrew (modern) 23. Malay 24. Slovene \n", " 25. Bulgarian 26. Hindi 27. Japanese \n", " 28. Hungarian 29. Croatian 30. Ukrainian \n", " 31. Serbian 32. Lithuanian 33. Norwegian \n", " 34. Latvian 35. Swedish 36. English \n", " 37. Greek, Modern 38. Spanish; Castilian 39. Vietnamese \n", " 40. Estonian \n" ] } ], "source": [ "from polyglot.downloader import downloader\n", "print(downloader.supported_languages_table(\"ner2\", 3))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Download Necessary Models" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[polyglot_data] Downloading package embeddings2.en to\n", "[polyglot_data] /home/rmyeid/polyglot_data...\n", "[polyglot_data] Package embeddings2.en is already up-to-date!\n", "[polyglot_data] Downloading package ner2.en to\n", "[polyglot_data] /home/rmyeid/polyglot_data...\n", "[polyglot_data] Package ner2.en is already up-to-date!\n" ] } ], "source": [ "%%bash\n", "polyglot download embeddings2.en ner2.en" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Entities inside a text object or a sentence are represented as chunks.\n", "Each chunk identifies the start and the end indices of the word subsequence within the text." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from polyglot.text import Text" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": true }, "outputs": [], "source": [ "blob = \"\"\"The Israeli Prime Minister Benjamin Netanyahu has warned that Iran poses a \"threat to the entire world\".\"\"\"\n", "text = Text(blob)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can query all entities mentioned in a text." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[I-ORG([u'Israeli']), I-PER([u'Benjamin', u'Netanyahu']), I-LOC([u'Iran'])]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text.entities" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or, we can query entites per sentence" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The Israeli Prime Minister Benjamin Netanyahu has warned that Iran poses a \"threat to the entire world\". \n", "\n", "I-ORG [u'Israeli']\n", "I-PER [u'Benjamin', u'Netanyahu']\n", "I-LOC [u'Iran']\n" ] } ], "source": [ "for sent in text.sentences:\n", " print(sent, \"\\n\")\n", " for entity in sent.entities:\n", " print(entity.tag, entity)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By doing more careful inspection of the second entity `Benjamin Netanyahu`, we can locate the position of the entity within the sentence." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "WordList([u'Benjamin', u'Netanyahu'])" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "benjamin = sent.entities[1]\n", "sent.words[benjamin.start: benjamin.end]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Command Line Interface" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ ", O \r\n", "which O \r\n", "was O \r\n", "equalled O \r\n", "five O \r\n", "days O \r\n", "ago O \r\n", "by O \r\n", "South I-LOC\r\n", "Africa I-LOC\r\n", "in O \r\n", "their O \r\n", "victory O \r\n", "over O \r\n", "West I-ORG\r\n", "Indies I-ORG\r\n", "in O \r\n", "Sydney I-LOC\r\n", ". O \r\n", "\r\n" ] } ], "source": [ "!polyglot --lang en tokenize --input testdata/cricket.txt | polyglot --lang en ner | tail -n 20" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Demo" ] }, { "cell_type": "raw", "metadata": {}, "source": [ ".. raw:: html\n", " \n", " \n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Citation\n", "\n", "This work is a direct implementation of the research being described in the [Polyglot-NER: Multilingual Named Entity Recognition](https://sites.google.com/site/rmyeid/papers/polyglot-ner.pdf?attredirects=0&d=1) paper.\n", "The author of this library strongly encourage you to cite the following paper if you are using this software." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "```\n", "@article{polyglotner,\n", " author = {Al-Rfou, Rami and Kulkarni, Vivek and Perozzi, Bryan and Skiena, Steven},\n", " title = {{Polyglot-NER}: Massive Multilingual Named Entity Recognition},\n", " journal = {{Proceedings of the 2015 {SIAM} International Conference on Data Mining, Vancouver, British Columbia, Canada, April 30 - May 2, 2015}},\n", " month = {April},\n", " year = {2015},\n", " publisher = {SIAM}\n", "}\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## References\n", "\n", "- [Polyglot-NER project page.](https://bit.ly/polyglot-ner)\n", "- [Wikipedia on NER](http://en.wikipedia.org/wiki/Named-entity_recognition)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.6" } }, "nbformat": 4, "nbformat_minor": 0 }